The aim of my investigation is to see if any variables affect volatile acidity which in turn affects the quality of the white wine. The dataset consists of 12 variables and 4898 observations.
# Load all of the packages
library(ggplot2)
library(reshape)
library(corrplot)
# Load the Data
wine <- read.csv('wineQualityWhites.csv')
The dataset description is shown below. We created a new variable called bound sulfur dioxide which is nothing but total sulfur dioxide subtracted by the free sulfur dioxide.
#Renamed varible X to Wine.ID
wine$X <- NULL
#Created "Bound Sulphur dioxide" varible
wine <- within(wine, bound.sulfur.dioxide <- total.sulfur.dioxide - free.sulfur.dioxide )
#Structure of Data
dim(wine)
## [1] 4898 13
str(wine)
## 'data.frame': 4898 obs. of 13 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## $ bound.sulfur.dioxide: num 125 118 67 139 139 67 106 125 118 101 ...
#summary of Data
summary(wine)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality bound.sulfur.dioxide
## Min. :3.000 Min. : 4.0
## 1st Qu.:5.000 1st Qu.: 78.0
## Median :6.000 Median :100.0
## Mean :5.878 Mean :103.1
## 3rd Qu.:6.000 3rd Qu.:125.0
## Max. :9.000 Max. :331.0
Now, I will be performing Univariate, Bivariate and Multivariate analysis.
#Function to generate ggplots of some features
univ_cont <- function(feat) {
ggplot(data=wine, aes_string(x = feat)) + geom_histogram()
}
uni_va <- univ_cont("volatile.acidity")
uni_ph <- univ_cont("pH")
uni_den <- univ_cont("density")
uni_alc <- univ_cont("alcohol")
uni_sul <- univ_cont("sulphates")
#Histogram chat of volatile.acidity
plot(uni_va)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#Histogram chart of volatile.acidity - outliers removed
ggplot(aes(x = volatile.acidity), data = wine)+
geom_histogram(binwidth = 0.01)+
coord_trans(y = 'sqrt')+
scale_x_continuous(limits = c(0.1,0.70), breaks = seq(0.1,0.70,0.1))
## Warning: Removed 24 rows containing non-finite values (stat_bin).
The distribution appears unimodal with the volatile acidity peaking around 0.28.
Is there any effect on the quality? What does this plot looks like across the categorical variables of quality.
#Bar chart of quality
ggplot(aes(x = quality), data = wine)+
geom_bar()+
scale_x_continuous(limits = c(0,10), breaks = seq(0,10,1))
The majority of white wines have a quality level 5 and 6.
#Bar chart of pH level
plot(uni_ph)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#Histogram chart of PH level - outliers removed
ggplot(aes(x = pH), data = wine)+
geom_histogram(binwidth = 0.01)+
scale_x_continuous(limits = c(2.8,3.6), breaks = seq(3,3.6,0.05))
## Warning: Removed 49 rows containing non-finite values (stat_bin).
#Summary chart of pH level
summary(wine$pH)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
table(wine$volatile.acidity)
##
## 0.08 0.085 0.09 0.1 0.105 0.11 0.115 0.12 0.125 0.13 0.135 0.14
## 4 1 1 6 6 13 3 34 3 44 1 56
## 0.145 0.15 0.155 0.16 0.165 0.17 0.175 0.18 0.185 0.19 0.2 0.205
## 4 88 5 141 2 140 1 177 5 170 214 4
## 0.21 0.215 0.22 0.225 0.23 0.235 0.24 0.245 0.25 0.255 0.26 0.265
## 191 1 229 4 216 4 253 4 231 10 240 5
## 0.27 0.275 0.28 0.285 0.29 0.295 0.3 0.305 0.31 0.315 0.32 0.325
## 218 3 263 5 160 3 198 4 148 4 182 2
## 0.33 0.335 0.34 0.345 0.35 0.355 0.36 0.365 0.37 0.375 0.38 0.385
## 134 7 135 9 86 1 104 2 65 2 63 2
## 0.39 0.395 0.4 0.405 0.41 0.415 0.42 0.425 0.43 0.435 0.44 0.445
## 61 2 59 1 54 4 36 2 35 2 46 4
## 0.45 0.455 0.46 0.47 0.475 0.48 0.485 0.49 0.495 0.5 0.51 0.52
## 25 2 30 15 3 17 3 14 2 14 10 10
## 0.53 0.54 0.545 0.55 0.555 0.56 0.57 0.58 0.585 0.59 0.595 0.6
## 8 10 1 14 2 9 4 7 2 4 2 7
## 0.61 0.615 0.62 0.63 0.64 0.65 0.655 0.66 0.67 0.68 0.685 0.69
## 7 4 5 2 7 2 3 4 5 3 1 2
## 0.695 0.705 0.71 0.73 0.74 0.75 0.76 0.78 0.785 0.815 0.85 0.905
## 3 2 1 1 1 1 2 1 1 1 1 1
## 0.91 0.93 0.965 1.005 1.1
## 1 1 1 1 1
There is a peak around 3.14. The pH level is probably affected by acidity. Minimum level of pH is 2.720 and maximum is 3.820.
#Histogram of density
plot(uni_den)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#Histogram of density - outliers removed
ggplot(aes(x = density), data = wine)+
geom_histogram(binwidth = 0.0002)+
scale_x_continuous(limits = c(0.988,1.001), breaks = seq(0.988,1.001,0.001))+
coord_trans(y = 'sqrt')
## Warning: Removed 28 rows containing non-finite values (stat_bin).
#Summary data of density
summary(wine$density)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
Density has a very small range, from 0.9871 to 1.0390
#Bar chart of alcohol percentage
plot(uni_alc)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#Bar chart of alcohol percentage - outliers removed
ggplot(aes(x = alcohol), data = wine)+
geom_histogram(binwidth = 0.1)+
scale_x_continuous(limits = c(8.5,13.6), breaks = seq(8.5,13.6))+
coord_trans(y = 'sqrt')
## Warning: Removed 24 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).
#Summary data of alcohol percentage
summary(wine$alcohol)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
There is a peak around 9.4, and the distribution is skewed to the right.
#Histogram of sulphates
plot(uni_sul)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#Histogram of sulphates - outliers removed
ggplot(aes(x = sulphates), data = wine)+
geom_histogram(binwidth = 0.005)+
scale_x_continuous(limits = c(0.15,0.9), breaks = seq(0.15,0.9,0.05))+
coord_trans(y = 'sqrt')
## Warning: Removed 24 rows containing non-finite values (stat_bin).
table(wine$sulphates)
##
## 0.22 0.23 0.25 0.26 0.27 0.28 0.29 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37
## 1 1 4 4 13 13 16 31 35 54 59 84 85 120 129
## 0.38 0.39 0.4 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52
## 214 151 168 139 181 161 216 178 225 172 179 166 249 140 156
## 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67
## 135 167 102 108 83 99 97 88 45 68 48 67 28 36 35
## 0.68 0.69 0.7 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79 0.8 0.81 0.82
## 44 30 27 18 33 12 19 22 19 16 19 16 5 5 13
## 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.9 0.92 0.94 0.95 0.96 0.97 0.98 0.99
## 2 4 3 2 2 7 1 5 2 2 5 3 1 6 1
## 1 1.01 1.06 1.08
## 1 1 1 1
There is a peak around 0.55. Distribution is skewed to the right.
The distribution appears slightly bi-modal with the sulphate concentration peaking around 0.38 and again at 0.5.
Data-frame consists of 4898 white wines of 12 original variables (Wine id, fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol and quality) + 1 derived variable(Bound Sulphur dioxide). The variable quality is ordered factor variable with the following levels.
Quality: (Worst) 0, 1, ———> , 9,10 (Best)
Salient observations:
The main feature in the data set is volatile acidity. I wanted to find out how volatile acidity increase or decrease w.r.t the quality of the white wine. I suspect pH and some combination of the other variables can be used to build a predictive model to grade white wines.
I would like to see if the amount residual sugar increases the quality of the white wine, and also if there is any connection with the amount of alcohol in the wine itself.
A new variable was created named “bound.sulfur.dioxide”. It was shown in the summary of the data frame and was later used in the bi-variate plots section.
I found that the alcohol percentage distribution was right skewed compared to the other variables that I investigated. Most of the white wines were below 13% of alcohol. In most of the cases, I removed the outliers to get a better look at the data.
#Correlation matrix using pearson method
round(cor(wine, method = 'pearson'),3)
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.000 -0.023 0.289
## volatile.acidity -0.023 1.000 -0.149
## citric.acid 0.289 -0.149 1.000
## residual.sugar 0.089 0.064 0.094
## chlorides 0.023 0.071 0.114
## free.sulfur.dioxide -0.049 -0.097 0.094
## total.sulfur.dioxide 0.091 0.089 0.121
## density 0.265 0.027 0.150
## pH -0.426 -0.032 -0.164
## sulphates -0.017 -0.036 0.062
## alcohol -0.121 0.068 -0.076
## quality -0.114 -0.195 -0.009
## bound.sulfur.dioxide 0.136 0.157 0.102
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.089 0.023 -0.049
## volatile.acidity 0.064 0.071 -0.097
## citric.acid 0.094 0.114 0.094
## residual.sugar 1.000 0.089 0.299
## chlorides 0.089 1.000 0.101
## free.sulfur.dioxide 0.299 0.101 1.000
## total.sulfur.dioxide 0.401 0.199 0.616
## density 0.839 0.257 0.294
## pH -0.194 -0.090 -0.001
## sulphates -0.027 0.017 0.059
## alcohol -0.451 -0.360 -0.250
## quality -0.098 -0.210 0.008
## bound.sulfur.dioxide 0.345 0.194 0.264
## total.sulfur.dioxide density pH sulphates alcohol
## fixed.acidity 0.091 0.265 -0.426 -0.017 -0.121
## volatile.acidity 0.089 0.027 -0.032 -0.036 0.068
## citric.acid 0.121 0.150 -0.164 0.062 -0.076
## residual.sugar 0.401 0.839 -0.194 -0.027 -0.451
## chlorides 0.199 0.257 -0.090 0.017 -0.360
## free.sulfur.dioxide 0.616 0.294 -0.001 0.059 -0.250
## total.sulfur.dioxide 1.000 0.530 0.002 0.135 -0.449
## density 0.530 1.000 -0.094 0.074 -0.780
## pH 0.002 -0.094 1.000 0.156 0.121
## sulphates 0.135 0.074 0.156 1.000 -0.017
## alcohol -0.449 -0.780 0.121 -0.017 1.000
## quality -0.175 -0.307 0.099 0.054 0.436
## bound.sulfur.dioxide 0.922 0.504 0.003 0.136 -0.427
## quality bound.sulfur.dioxide
## fixed.acidity -0.114 0.136
## volatile.acidity -0.195 0.157
## citric.acid -0.009 0.102
## residual.sugar -0.098 0.345
## chlorides -0.210 0.194
## free.sulfur.dioxide 0.008 0.264
## total.sulfur.dioxide -0.175 0.922
## density -0.307 0.504
## pH 0.099 0.003
## sulphates 0.054 0.136
## alcohol 0.436 -0.427
## quality 1.000 -0.218
## bound.sulfur.dioxide -0.218 1.000
I noticed from the Pearson correlation above that the strongest correlations with volatile acidity are bound sulfur dioxide and quality. The correlation coefficients are 0.157 and -0.195, respectively. Let’s look at the visual representation of the correlations.
#Correlation plot
cm <- round(cor(wine, method = 'pearson'),3)
corrplot(cm, method = "circle")
We can clearly see from the size and color of the circles that volatile acidity has the strongest correlation with citric acid, quality, and bound sulfur dioxide, as stated above. Thus, the next step will be making bi-variate plot for each of the four variables
#Jitter Plot of citric.acid vs volatile acidity
vola <- ggplot(aes(x = citric.acid, y = volatile.acidity), data = wine)
vola + geom_jitter()
#Jitter Plot of citric.acid vs volatile acidity - outliers removed
vola + geom_jitter(alpha = 1/5)+
scale_x_continuous(limits = c(0,0.75), breaks = seq(0,0.75,0.05))+
geom_smooth()
## Warning: Removed 22 rows containing non-finite values (stat_smooth).
## Warning: Removed 31 rows containing missing values (geom_point).
The amount of volatile acidity decreases as citric acid increases. Could the citric acid have an effect on the taste of the white wine?
#Box Plot of quality vs volatile acidity
qua <- ggplot(aes(x = factor(quality), y = volatile.acidity), data = wine)
qua + geom_boxplot()+
geom_jitter(position=position_jitter(width=.1, height=0))
#Summary of quality vs volatile acidity
by(wine$volatile.acidity, wine$quality, summary)
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1700 0.2375 0.2600 0.3332 0.4125 0.6400
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1100 0.2700 0.3200 0.3812 0.4600 1.1000
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.100 0.240 0.280 0.302 0.340 0.905
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2000 0.2500 0.2606 0.3000 0.9650
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.1900 0.2500 0.2628 0.3200 0.7600
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.2000 0.2600 0.2774 0.3300 0.6600
## --------------------------------------------------------
## wine$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.240 0.260 0.270 0.298 0.360 0.360
The amount of volatile acidity in level 4 of quality would confirm how volatile acidity affects the taste of the wine.
#Jitter Plot of bound.sulfur.dioxide vs volatile acidity
bound <- ggplot(aes(x = bound.sulfur.dioxide, y = volatile.acidity), data = wine)
bound + geom_jitter()
#Jitter Plot of bound.sulfur.dioxide vs volatile acidity - outliers removed
bound + geom_jitter(alpha = 1/3)+
scale_y_continuous(limits = c(0.10,0.9))+
scale_x_continuous(limits = c(25,250), breaks = seq(25,250,10))+
geom_smooth()
## Warning: Removed 35 rows containing non-finite values (stat_smooth).
## Warning: Removed 38 rows containing missing values (geom_point).
The amount of volatile acidity increases as bound sulfur dioxide increases.
Let’s also look into alcohol against quality.
#Jitter Plot of alcohol vs quality
alc_qua <- ggplot(aes(x = quality, y = alcohol), data = wine)
alc_qua + geom_jitter()
#Jitter Plot of alcohol vs quality
alc_qua <- ggplot(aes(x = jitter(quality), y = alcohol), data = wine)
alc_qua + geom_jitter(alpha = 1/3)+
scale_x_continuous(breaks = seq(0,10,1))+
geom_smooth()
Interestingly we observe a trend : as the alcohol percentage increases so do the quality.
#Jitter Plot of free.sulfur.dioxide vs bound.sulfur.dioxide
fb <- ggplot(aes(x = free.sulfur.dioxide, y = bound.sulfur.dioxide), data = wine)
fb + geom_jitter(alpha = 1/3)+
scale_x_continuous(limits = c(0,80), breaks = seq(0,80,10))+
geom_smooth()
## Warning: Removed 50 rows containing non-finite values (stat_smooth).
## Warning: Removed 50 rows containing missing values (geom_point).
Visual of bound vs free sulfur dioxide, showing a positive correlation.
Volatile acidity correlates strongly with citric acid and bound sulfur dioxide.
The amount of volatile acidity decreases as citric acid increases, but the data was widely spread and only showing small clusters of data.
The overlay of jitter data on top of the box plot of volatile acidity against quality create a good visual for comparison of the different qualities.
The visual for volatile acidity against bound sulfur dioxide didn’t really show a good explanation as the data was widely spread, but did show a increase of volatile acidity when bound sulfur dioxide had increased a lot.
With the new variable that I created, it show good correlation between free sulfur dioxide and bound sulfur dioxide. Also, alcohol against quality showed that as the alcohol percentage increases so do the quality.
The level of volatile acidity showed a negative correlation with quality showing that the quality of white wine increased.
#Jitter Plot of citric.acid vs volatile acidity factored by quality
gpfq <- geom_point(aes(color = factor(quality)))
vola + gpfq + scale_color_brewer(palette = "Reds")+
theme_dark()
The volatile acidity plot elaborate on the odd trends that were seen in the box plots earlier. Most quality levels 6 and above do not exceed 0.75 of volatile acidity.
#Jitter Plot of bound.sulfur.dioxide vs volatile acidity
bound + gpfq+
scale_y_continuous(limits = c(0.10,0.9))+
scale_x_continuous(limits = c(25,250), breaks = seq(25,250,20))+
scale_color_brewer(palette = "Greens")+
theme_dark()
## Warning: Removed 35 rows containing missing values (geom_point).
Most of the different qualities are wide spread but there does seem to be a large grouping from 45-170 grams of bound sulfur dioxide.
The citric acid plot against volatile acidity showed a good correlation as the quality of white wine increased, even though the correlation was negative.
Surprisingly, we see that higher quality wines are having lower bound sulfur dioxide, which can be seen by difference in shades of green in plot.
No.
ggplot(aes(x = volatile.acidity), data = wine)+
geom_histogram(binwidth = 0.01)+
scale_x_continuous(limits = c(0.1,0.7), breaks = seq(0.1,0.7,0.1))+
labs(list(title = "Volatile Acidity in White Wine", x = "Volatile Acidity(g/dm3)", y = "Count of White Wines"))
## Warning: Removed 24 rows containing non-finite values (stat_bin).
The distribution of volatile acidity appear to be unimodal. There is a curious spike around 0.28.
alc_qua + geom_jitter(alpha = 1/3)+
scale_x_continuous(breaks = seq(0,10,1))+
geom_smooth()+
labs(list(title = "Quality of Alcohol in White Wine", x = "Quality(0 to 10)", y = "Alcohol (%)"))
The quality level of different white wines confirmed that as the level increased the volatile acidity was reduced.
vola + gpfq +
scale_color_brewer(palette = "Blues")+
theme_dark()+
labs(list(title = "Quality of Volatile Acidity vs Citric Acid in White Wine", x = "Citric Acid(g/dm3)", y = "Volatile Acidity(g/dm3)", colour = "Quality of Wine"))
The quality of wine increases as we move towards the lower right of the plot. Wine seems to have better quality when citric acid is around 0.15 and volatile acidity is 0.3.
This data set contains information on 4,898 different white wines from a 2009 study. My goal was to find which chemical properties affected the volatile acidity in the white wine. I started out by exploring the distribution of individual variables and looked for unusual behaviors in the histograms. I then calculated and plotted the correlations between volatile acidity and the variables. None of the correlations were above 0.5. The two variables that had relatively strong correlations were citric acidity and bound sulfur dioxide, but the individual correlations were not strong enough to make definitive conclusions with only bi-variate analysis methods. However, plotting the multivariate plot shown as Final Plot 3 showed the increase in quality with certain citric acidity values. One suggestion for this data set is to include storage time and storage method since these factors can influence the quality of wine as well. Further studies might include the relationship between price and quality of wine to investigate whether expensive wines lead to better quality.